PetFinder Adoption Analysis
Chapter 1: Introduction
This dataset came from Kaggle (https://www.kaggle.com/c/petfinder-adoption-prediction/data) and contains information about pet profiles that were listed for adoption on PetFinder.
Chapter 2: Data description and cleaning
Initially this dataset contained 14993 observations of 24 variables with an ordinal dependent variable of AdoptionSpeed.
We took a few initial steps to clean this dataset. First, we removed any animal profiles that had more than 1 pet (quantity = 1) to reduce the chance of a confounding variable. We then removed any pets that had a categorical adoption speed of 4 because that meant the animal had not been adopted. We decided to keep 8 independent variables for analysis: Type, Age, Gender, MaturitySize, FurLength, Vaccinated, PhotoAmt, and VideoAmt. The final pre-processing step that we took was to convert the dependent variable AdoptionSpeed to a continuous numerical variable ASnum. We did this by generating uniform random integers between the intervals that were specified by the dataset for each AdoptionSpeed bucket. The resulting dependent variable has values ranging from 0 to 90 days listed on PetFinder.
Our final dataset looks like this at a glance:
## Type Age Gender MaturitySize FurLength Vaccinated
## 1:4754 Min. : 0.0 1:3837 1:1890 1:4776 1:3498
## 2:3731 1st Qu.: 2.0 2:4648 2:5816 2:3109 2:3939
## Median : 3.0 3: 0 3: 751 3: 600 3:1048
## Mean : 10.3 4: 28
## 3rd Qu.: 9.0
## Max. :212.0
## PhotoAmt VideoAmt AdoptionSpeed ASnum
## Min. : 0.00 Min. :0.00 0: 331 Min. : 0.0
## 1st Qu.: 2.00 1st Qu.:0.00 1:2439 1st Qu.: 6.0
## Median : 3.00 Median :0.00 2:3163 Median :18.0
## Mean : 3.81 Mean :0.06 3:2552 Mean :26.5
## 3rd Qu.: 5.00 3rd Qu.:0.00 4: 0 3rd Qu.:41.0
## Max. :30.00 Max. :6.00 Max. :90.0
It contains 8485 observations of 10 variables, including both AdoptionSpeed and ASnum. Here is a look at the first 3 rows of the cleaned dataset.
## Type Age Gender MaturitySize FurLength Vaccinated PhotoAmt VideoAmt
## 1 2 3 1 1 1 2 1 0
## 2 2 1 1 2 2 3 2 0
## 3 1 1 1 2 2 1 7 0
## AdoptionSpeed ASnum
## 1 2 12
## 2 0 0
## 3 3 81
In order to make sense of the numbers for the variables we use the metadata from the dataset description on Kaggle to decode categorical values.
AdoptionSpeed - Categorical speed of adoption. Lower is faster.
Type - Type of animal (1 = Dog, 2 = Cat)
Age - Age of pet when listed, in months
Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
PhotoAmt - Total uploaded photos for this pet
PhotoAmt - Total uploaded videos for this pet
Chapter 3: Independent Variables EDA
Numerical variable
In this chapter we will introduce some variables. The first one is AdoptionSpeed.
AdoptionSpeed is divided into 4 levels.
0 means pet was adopted on the same day as it was listed.
1 Pet was adopted between 1 and 7 days (1st week).
2 Pet was adopted between 8 and 30 days (1st month) after being listed.
3 Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.
4 No adoption after 100 days of being listed. (we drop this level)
We transformed AdoptionSpeed to numerical variable: ASnum.
This is Age. Although this variable is age, it represents the number of months since birth when they were listed on PetFinder. It is obviously that most of the pets are young, which makes sense that their owners do not have enough energy to take care of so many baby pets after their pets give birth to puppies and kittens.
This is photoAmt: number of photos included in a profile. Most of the listed pets only have few pictures.
Categorical variable
Let’s us focus on some categorical variables.
The first one is Type. There are two types: dog and cat.
The second one is Gender. There are two genders: male and female.
The third one is MaturitySize. There are four levels, which is small, medium, large, and extraLarge.
The fourth one is FurLength. There are three levels, which is short, medium, and long.
The fifth one is Vaccinated. There are three levels, which is Vaccinated, Unvaccinated, and Not Specified.
Chapter 4: Independent Variables EDA (Variance, T-Test, ANOVA)
Given all the variables we’ve just seen, we started asking questions about what impacted the adoption speed of individual animals on PetFinder. We started to answer this question by looking at the categorical variables we chose to isolate in our dataset. These variables are:
Type: Dogs and Cats
Gender: Male and Female
Size: Small, Medium, Large, and Extra Large
Fur Length: Short, Medium, and long
Vaccinated: Vaccinated, Unvaccinated, and Unspecified
SMART: Do dogs get adopted faster than cats?
First, we looked at animal Type. In order to answer this question, we split the data into two groups, dogs and cats.
Looking at this graph it appeared that dogs were getting adopted slower than cats. Because there were only 2 samples we decided to use a two-sample t-test to evaluate if there was a significant difference between the average adoption speed of dogs and cats. However, the first thing we had to do was a check of the homogeneity of variance. To do this we did a Breusch-Pagan (BP) test.
H0: The two groups have the same variance
H1: The groups have difference variances
The results of this BP test was:
p = 7.309^{-6}
Because the p_value is less than our alpha of 0.05 we have to reject the null hypothesis that the groups have the same variance. Because of this, the two-sample t-test is not technically a valid statistical test, however we went ahead and performed the test regardless.
H0: The mean adoption speed of dogs and cats are equal
H1: The mean adoption speed of dogs and cats are different
The p_value of the t-test was less than our alpha of 0.05 (p = 1.403^{-21}), allowing us to reject our null hypothesis that there is no difference between the means of the two groups. Because of this we can conclude that there is a difference in adoption speed between cats and dogs.
If we look at the mean estimates for the two groups:
| Type | Dogs | Cats |
|---|---|---|
| Avg Days | 28.8 | 23.5 |
We can see that the adoption speed for dogs is higher (slower) than the mean adoption speed for cats, therefore we can conclude that dogs get adopted slower than cats, and that animal type does have an effect on adoption speed.
SMART: Do physical attributes affect adoption speed?
The next question we wanted to answer was what physical attributes affect adoption speed. To answer this question we looked at the gender of the animal, the size, the fur length, and the vaccination status. Beginning with the gender of the animal, we compared male animals to female animals.
From this plot we can see that female animals appear to be adopted slower than male animals. Before we could make any conclusions we ran a variance check.
H0: The two groups have the same variance
H1: The groups have difference variances
Our result was:
p = 0.001
We were able to reject the null hypothesis that the variance was equal between the groups. Again, this meant that a t-test was not appropriate, however we did a two-sample t-test regardless.
H0: The mean adoption speed of male and female animals are equal
H1: The mean adoption speed of male and female animals are different
We found that this was also significant (p = 3.023^{-8}). So we can reject the null hypothesis that the mean adoption speed between male and female animals is the same.
Looking at the estimates:
| Gender | Male | Female |
|---|---|---|
| Avg Days | 24.8 | 27.9 |
Because the mean adoption speed of male animals was less than the mean adoption speed than female animals, we can conclude that female animals get adopted quicker than male dogs.
Next we wanted to look at the size of the animals.
We plotted both the boxplot and the density plot to emphasize the smililarities in the distribution shapes. The extra large animals appear to have the only different distribution, however its evident from the box-plot that there are very few data points for extra large animals. In order to determine if there are any differences in adoption speed we decided to run an ANOVA.
First we ran a variance check.
H0: The adoption speed variance is the same across different sizes
H1: The adoption speed variance is not the same across different sizes
Our result was:
p = 0.032
Once again, we got a significant p_value but decided to run the ANOVA regardless.
H0: The mean adoption speed is the same for all sizes
H1: The mean adoption speed is not the same for all sizes
We ended up with a significant p_value for the anova (p = 6.879^{-13}, NA), so we ran a TukeyHSD post-hoc test.
We found that the means between the small and medium dogs, and the medium and large dogs, were significantly different. We can therefore conclude that animal size does impact adoption speed.
Next we looked at fur length.
Again, it’s clear that there is an imbalance of sample sizes for the long haired animals. The density plots show the similarities in the adoption speed distributions. It appears that the short haired animals have a more profound elevation around 25 days on PetFinder. To analyze these differences we decided to use an ANOVA.
First we ran a variance check.
H0: The adoption speed variance is the same across different fur lengths
H1: The adoption speed variance is not the same across different fur lengths
Our result was:
p = 0.087
Once again our BP test of equal variance failed. However we continued with our analysis.
H0: The mean adoption speed is the same for all fur lengths
H1: The mean adoption speed is not the same for all fur lengths
| diff | lwr | upr | p adj | |
|---|---|---|---|---|
| 2-1 | -2.38 | -3.76 | -1.00 | 2e-04 |
| 3-1 | -6.46 | -9.05 | -3.86 | 0e+00 |
| 3-2 | -4.07 | -6.74 | -1.41 | 1e-03 |
Our ANOVA did produce a significant result (p = 6.557^{-10}) allowing us to reject the null hypothesis that the adoption speed was the same across fur lengths. Because we got a significant result we again ran a TukeyHSD post-hoc analysis and found that the difference between each of the fur length groups was significant. We’re able to conclude that fur length does impact adoption speed.
Finally, we looked at vaccination status.
Again the density plots are showing a very similar distribution across the different vaccination statuses. IN order to evaluate if there is a difference in the mean values we decided to use an ANOVA.
First we ran a variance check.
H0: The adoption speed variance is the same across different vaccination statuses
H1: The adoption speed variance is not the same across different vaccination statuses
Our result was:
p = 8.325^{-5}
Once again our BP test of equal variance failed. However we continued with our analysis.
H0: The mean adoption speed is the same for all vaccination statuses
H1: The mean adoption speed is not the same for all vaccination statuses
| diff | lwr | upr | p adj | |
|---|---|---|---|---|
| 2-1 | -5.26 | -6.652 | -3.8770 | 0.0000 |
| 3-1 | -2.19 | -4.298 | -0.0912 | 0.0385 |
| 3-2 | 3.07 | 0.994 | 5.1460 | 0.0015 |
Our ANOVA did produce a significant result (p = 6.253^{-18}). You can see from our post-hoc analysis we that the difference between the vaccinated and unvaccinated, and the unvaccinated and unspecified group was significantly different. We can conclude that vaccination status did impact the adoption speed of the animal.
The final conclusion from our analysis of categorical physical characteristics is that the difference in variance across the groups does not allow us to draw any meaningful conclusions. However, if we ignore that assumption for this sample of data, we are able to conclude that all the categorical physical characteristics, including the type of the animal, does impact the adoption speed of the animal.
Chapter 5: Linear Modeling
SMART: What numerical variables influence adoption speed?
We will run single variable OLS regressions for the three independent numerical variables. First, age.
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| (Intercept) | 26 | 25, 26 | <0.001 |
| Age | 0.06 | 0.03, 0.09 | <0.001 |
|
1
CI = Confidence Interval
|
|||
Clearly, age is statistically significant, as the p-value is below 0.05.
Lets look at VideoAmt next.
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| (Intercept) | 26 | 26, 27 | <0.001 |
| VideoAmt | 1.6 | -0.05, 3.3 | 0.057 |
|
1
CI = Confidence Interval
|
|||
Finally, lets check PhotoAmt.
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| (Intercept) | 23 | 23, 24 | <0.001 |
| PhotoAmt | 0.81 | 0.64, 1.0 | <0.001 |
|
1
CI = Confidence Interval
|
|||
As we can see, only Age and PhotoAmt results in a p-value less than 0.05.
Answer to SMART question 3: Age and PhotoAmt are the only two numerical variables which alone can be considered statistically significant.
SMART: What variables, both categorical and numerical, result in the best predictave model?
Using feature selection tools to identify best model.
Starting with generating an OLS model including all the variables we are considering.
## Reordering variables and trying again:
Let us now create plots for all 4 of the methods we want to use. Exhaustive, forward, backward, and sequential methods.
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
No difference in variables between exhaustive, forward, and backward selection: Age, Gender 2, MaturitySize2, MaturitySize3, FurLength2, Furlength3, Vaccinated2, and PhotoAmt
Sequential Replacement recommended: Gender2, MaturitySize2, FurLength2, FurLength3, Vaccinated2, PhotoAmt
Dropping Age, MaturitySize3 between two models
| Characteristic | Beta | 95% CI1 | p-value | GVIF1 |
|---|---|---|---|---|
| (Intercept) | 21 | 19, 23 | <0.001 | |
| Age | 0.08 | 0.05, 0.12 | <0.001 | 1.2 |
| Gender | 1.0 | |||
| 1 | — | — | ||
| 2 | 3.3 | 2.1, 4.4 | <0.001 | |
| MaturitySize | 1.1 | |||
| 1 | — | — | ||
| 2 | 4.5 | 3.1, 5.9 | <0.001 | |
| 3 | 0.79 | -1.6, 3.1 | 0.5 | |
| FurLength | 1.1 | |||
| 1 | — | — | ||
| 2 | -2.8 | -4.0, -1.6 | <0.001 | |
| 3 | -6.7 | -9.2, -4.3 | <0.001 | |
| Vaccinated | 1.1 | |||
| 1 | — | — | ||
| 2 | -4.6 | -5.8, -3.4 | <0.001 | |
| PhotoAmt | 0.82 | 0.65, 1.0 | <0.001 | 1.0 |
|
1
CI = Confidence Interval, GVIF = Generalized Variance Inflation Factor
|
||||
This model gives an adjusted \(R^2\) value of 0.037.
| Characteristic | Beta | 95% CI1 | p-value | GVIF1 |
|---|---|---|---|---|
| (Intercept) | 23 | 21, 25 | <0.001 | |
| Gender | 1.0 | |||
| 1 | — | — | ||
| 2 | 3.4 | 2.2, 4.6 | <0.001 | |
| MaturitySize | 1.0 | |||
| 1 | — | — | ||
| 2 | 4.2 | 2.8, 5.6 | <0.001 | |
| FurLength | 1.0 | |||
| 1 | — | — | ||
| 2 | -2.8 | -4.1, -1.5 | <0.001 | |
| 3 | -6.3 | -9.0, -3.6 | <0.001 | |
| Vaccinated | 1.0 | |||
| 1 | — | — | ||
| 2 | -5.7 | -6.9, -4.5 | <0.001 | |
| PhotoAmt | 0.74 | 0.56, 0.92 | <0.001 | 1.0 |
|
1
CI = Confidence Interval, GVIF = Generalized Variance Inflation Factor
|
||||
This model gives an adjusted \(R^2\) value of 0.0343
As the exhaustive-forward-backward model has a higher \(R^2\) value than the sequential model, the conclusion is that the following model is the best model, and answer to our last SMART.
ASnum ~ Age + Gender 2 + MaturitySize2 + MaturitySize3 + FurLength2 + Furlength3 + Vaccinated2 + PhotoAmt
Conclusion
Categorical Variables
Type of animal, gender, size, fur length, and vaccination status all impact our dependent variable of adoption speed. However, the variance of these values is not equal, so any relationship that is discovered between these variables should be considered lightly.
Linear Model
Only Age and PhotoAmt, out of the numerical independent variables, have an impact on ASnum, VideoAmt is not statistically relevant.
The best model: ASnum ~ Age + Gender 2 + MaturitySize2 + MaturitySize3 + FurLength2 + Furlength3 + Vaccinated2 + PhotoAmt.